Around Inverse Reinforcement Learning and Score-based Classification
نویسندگان
چکیده
Inverse reinforcement learning (IRL) aims at estimating an unknown reward function optimized by some expert agent from interactions between this expert and the system to be controlled. One of its major application fields is imitation learning, where the goal is to imitate the expert, possibly in situations not encountered before. A classic and simple way to handle this problem is to see it as a classification problem, mapping states to actions. The potential issue with this approach is that classification does not take naturally into account the temporal structure of sequential decision making. Yet, many classification algorithms consist in learning a score function, mapping state-action couples to values, such that the value of the action chosen by the expert is higher than the others. The decision rule of the classifier maximizes the score over actions for a given state. This is curiously reminiscent of the state-action value function in reinforcement learning, and of the associated greedy policy. Based on this simple statement, we propose two IRL algorithms that incorporate the structure of the sequential decision making problem into some classifier in different ways. The first one, SCIRL (Structured Classification for IRL), starts from the observation that linearly parameterizing a reward function by some features imposes a linear parametrization of the Q-function by a so-called feature expectation. SCIRL simply uses (an estimate of) the expert feature expectation as the basis function of the score function. The second algorithm, CSI (Cascaded Supervised IRL), applies a reversed Bellman equation (expressing the reward as a function of the Q-function) to the score function outputted by any scorebased classifier, which reduces to a simple (and generic) regression step. These two algorithms come with theoretical guarantees and perform competitively on toy problems.
منابع مشابه
Inverse Reinforcement Learning through Structured Classification
This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert ...
متن کاملPreference-Based Reinforcement Learning: A preliminary survey
Preference-based reinforcement learning has gained significant popularity over the years, but it is still unclear what exactly preference learning is and how it relates to other reinforcement learning tasks. In this paper, we present a general definition of preferences as well as some insight how these approaches compare to reinforcement learning, inverse reinforcement learning and other relate...
متن کاملAdapting Reinforcement Learning to Tetris
This paper discusses the application of reinforcement learning to Tetris. Tetris and reinforcement learning are both introduced and defined, and relevent research is discussed. An agent based on existing research is implemented and investigated. A reduced representation of the Tetris state space is then developed, and several new agents are implemented around this state space. The implemented a...
متن کاملA Cascaded Supervised Learning Approach to Inverse Reinforcement Learning
This paper considers the Inverse Reinforcement Learning (IRL) problem, that is inferring a reward function for which a demonstrated expert policy is optimal. We propose to break the IRL problem down into two generic Supervised Learning steps: this is the Cascaded Supervised IRL (CSI) approach. A classification step that defines a score function is followed by a regression step providing a rewar...
متن کاملScore-based Inverse Reinforcement Learning
This paper reports theoretical and empirical results obtained for the score-based Inverse Reinforcement Learning (IRL) algorithm. It relies on a non-standard setting for IRL consisting of learning a reward from a set of globally scored trajectories. This allows using any type of policy (optimal or not) to generate trajectories without prior knowledge during data collection. This way, any existi...
متن کامل